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Abstract 

O ! 

fvj , Cloud computing is a new paradigm where data is stored across multiple servers and the 

,„^ ■ goal is to compute a function of all the data. We consider a simple model where each server uses 

»^| polynomial time and space, but communication among servers being more expensive is ideally 

bounded by a polylogarithmic function of the input size. We will dub algorithms that satisfy 

these types of resource bounds as nimble. 

^^ ■ The main contribution of the paper is to develop nimble algorithms for several areas which in- 

Cn I volve massive data and for that reason have been extensively studied in the context of Streaming 

Algorithms. The areas are approximation of Frequency Moments, Counting bipartite homomor- 

^Q I phisms (number of copies of a fixed bipartite graph H in a graph G), Rank-fc approximation 

^^ ■ to a matrix, and Clustering. For frequency moments, we will use a new importance sampling 

• I technique based on high powers of the frequencies. We reduce the problem of counting homo- 

(\ • morphisms to estimating implicitly defined frequency moments. For rank-fc approximations, 

besides recent results of several authors developed in the Streaming context, we use a variant of 

the random projection method. For clustering, we use our rank-fc approximation and the small 

^^ I coreset of Chen [15 of size at most polynomial in the dimension. 

fS| ' In contrast to our algorithms in the cloud computing model, in the streaming model, known 

^O I lower bound results for frequency moments and rank-fc approximations rule out the existence of 

algorithms that use polylogarithmic space. 
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1 Introduction 

Cloud Computing is a new paradigm for storage and processing of massive data. The first objective 
of this paper is to formulate a clean high-level model of Cloud Computing. The bulk of the paper 
develops algorithms in this model. In addition to time and space, we measure communication 
as a critical resource for cloud computing algorithms. While there have been several models of 
Parallel and Distributed Computing, the Streaming model is perhaps the closest in spirit [31} [T2]. 
Surprisingly, we find that natural problems, such as the computation of frequency moments and 
low-rank matrix approximation are feasible in this model while they are known to be provably 
infeasible in the Streaming Model. 

In the frequency moment problem, the data consists of updates to the counts of elements, stored 
on many servers. 

Problem 1.1 (Frequency Moments) A nonnegative n-vector of frequencies of n distinct ele- 
ments, f = (/i,/2, ■ ■ ■ ,fn), is represented by a sequence of updates, each of the form {i,x), which 
indicates ''increment fi by x" for some x > 0. The objective is to compute the k'th moment of f, 
namely, Y17=i fi ' where k is a positive integer. 

We also consider a related problem of counting homomorphisms, i.e., the number of copies of 
one (small) graph in another (large) graph. 

In many applications, the data is a massive matrix has to be split across servers. One would 
like to compute Linear Algebra quantities of the whole matrix, but without having to communicate 
it across servers. Here, we study the following fundamental problem in this area. Given a matrix 
A stored across servers, find an approximation i? to ^ of rank at most k. The best approximation 
can be found by Singular Value Decomposition as is well-known. But here, we will be satisfied with 
a near-optimal B (where "near-optimal" is in terms of relative error) . 

Problem 1.2 (Low-rank Approximation) Given a n x d matrix A, a positive integer k and 
e > 0, find an n x d matrix B of rank at most k such that 

||^--B||i7' < (1 + e) • min \\A-X\\f. 

X:rank{X)<k 

[Here, for a matrix A, the Frobenius norm ||A|||, is the sum of squares of the entries of A.] 

We also consider one of the most popular variants of Clustering based on the /c- means objective 
function. We will develop algorithms for these problems in the cloud setting. In contrast, as we 
state more precisely below, in the streaming model, known lower bounds rule out such algorithms. 

1.1 The Model 

Our model is simple: there are s servers, where s is generally to be thought of as a constant (but 
may be a function of the size of the problem as well) . Each server has a part of the input data of 
the problem. For example, if it is a graph problem, the servers might have disjoint subsets of edges 
of the graph. A more general example is a matrix. We consider two distinct models. 

In the the row partition setting, the rows of the matrix are partitioned and each server has a 
subset of (whole) rows, whereas, in the arbitrary partition setting (called "turnstile model" in 
streaming), the matrix is given by a set of updates of the form (Aij,a) which says "increase Aij 
by q", where, a is positive or negative real number. We assume that the updates are partitioned 



arbitrarily among the servers. Since each server has polynomial time and space, it can just as well 
add up all the updates for each entry Aij and thus server t has an n x d matrix At such that the 
whole matrix A is given as 

A = Ai + A2 + --- + As. 

We will measure two resources: (1) time taken by the servers to solve a problem and (2) 
total communication among servers. Of these, we treat communication as the more expensive 
resource and generally restrict it to be polylogarithmic (sometime sublinear) in the size of the 
problem, whereas, we will generally allow the time taken by each processor (as well as the internal 
Random Access Memory) used by each processor to be polynomially bounded in the data size. 
To complete the model description, we we will say that a problem has a nimble algorithm if 
there is a randomized algorithm which solves the problem with initial data partitioned among the 
s processors arbitrarily, using polynomial time and polynomial space in each server and sublinear 
(ideally polylogarithmic when s G 0(1)) amount of communication. We will generally state explicit 
bounds on the resources used by each algorithm. 

Such a distributed model was introduced by Cormode et al [E], and subsequently studied by 
others, including Philips et al [32] and Woodruff and Zhang [3l]. The latter give algorithms and 
lower bounds for estimating frequency moments, we mention their results presently. This model 
has also been recently considered for distributed learning problems [TJ [26} ^^ . As observed in [25] , 
it is no weaker than streaming in the following sense: Any sketching algorithm (i.e., one that can 
be applied to arbitrary subsets of data and combine their outputs in arbitrary order) that uses 
0{s) space and p passes over the data can be implemented in the cloud with 0{sp) communication 
and the same asymptotic time complexity. 

1.2 Results 

Our first result is a nimble algorithm for estimating frequency moments. 

Theorem 1.1 For any positive integer k, the k 'th frequency moment of data presented as updates 
partitioned arbitrarily among s servers can he estimated to within relative error e with probability 
at least 3/4 using 0((2s) logn/e^) communication and 0{n) time per server. 

(C\ \^^''^ 
Prior to our work. Woodruff and Zhang [M] gave an algorithm that achieves s I — ^^^^ I 

communication. 

We can contrast the above result with the streaming model. Alon, Matias and Szegedy's seminal 
paper showed that frequency moments for k <2 can be computed in polylog space, and polynomial 
space for k greater than a larger constant. This was improved to nearly matching bounds all k, with 
an upper bound of 0(n^~^''') [27] and the lower bound showing that the /c'th frequency moment 
for A; > 2 needs i7(n^~^''^/logn) memory [3][8l[T3]. 

The main idea of our nimble algorithm is to sample elements from within a server according 
to higher moments. It turns out that sampling according to the squared value, which has been 
very effective for other settings, does not suffice here. There is also a nearly matching lower bound 
which is a direct consequence of the communication complexity of the multi-party set disjointness 
problem [3l[l3l[8]. 

Theorem 1.2 Estimating the k'th frequency moment of a set to within a factor of (1 + e) in the 
cloud model with s servers needs Q{s^~^ / e log k) communication. 



We next turn to another counting problem, namely counting homomorphisms, i.e., the number 
of copies of a small graph H in a large graph G, when the vertices of H (rows of its adjacency 
matrix) are partitioned arbitrarily among servers. This is a natural problem with many interesting 
special cases, such as counting the number of /c-cycles, stars/cliques of a fixed size etc. We will 
show that for a large class of graphs, the number of homomorphisms can be estimated to relative 
error by a nimble algorithm, assuming the vertices are partitioned arbitrarily among servers (i.e., 
the row partition model). We state here the result for counting complete bipartite graphs (which 
includes the case of stars Ki^t and 4-cycles). 

Theorem 1.3 The number of complete bipartite subgraphs K^^t in o- given graph G = (V^E) can 
be estimated to relative error {1 + e) by a nimble algorithm. 

As we will see in Section [3l we can in fact count the number of bipartite subgraphs, in which each 
vertex on one side must have its degree belonging to some given set of integers, e.g., the number 
of bipartite subgraphs Kj-^t with degree on the left at least t/2 (i.e., the degrees are constrained to 
the set S* = { \t/2\ , \t/2\ + 1, . . . , t}). We note that we cannot approximately count the number of 
cliques of size r (even triangles) with polylogarithmic communication; this is perhaps not surprising 
in the light of nearly linear lower bounds in the streaming model |31ill2j . 

Our next set of results are for low-rank approximation. We begin with the row partition modeo 

Theorem 1.4 Suppose the rows of the input n x d matrix A are partitioned among s servers 
arbitrarily with an rif x d matrix At in server t. For any 1 > e > 0, there is a nimble algorithm 
that, on termination, leaves an nt x d matrix Gt in server t such that the matrix C formed by all 
the Ct 's achieves 

\\A - G\\f < (S + e) min \\A - X\\f 

X:rank(X)<k 

using linear space, polynomial time and with total communication bounded by 0(sk/e) rows of A 
and 0{sk'^ /e^) additional real numbers. 

At the heart of the algorithm is a procedure to approximate the top k singular vectors with low 
communication. In Section HI we first develop simpler algorithms with somewhat higher communi- 
cation bounds of 0{sd'^) to solve the problem exactly (Theorem 14. 2p . and 0{skd) to get a factor 3 
approximation (Theorem 14. Sp before indicating a proof of Theorem II. 4i Note that \i s,k,d G 0(1), 
then the communication is polylogarithmic. The guarantee above is stronger for matrices whose 
rows are sparse. 

Our next result is for the arbitrary partition model. 

Theorem 1.5 Consider the arbitrary partition model where an nx d matrix At resides in server t 
and the data matrix A = Ai + A2 + ■ ■ ■ + Ag. For any 1 > e > 0, there is a nimble algorithm that, 
on termination, leaves a n x d matrix Gt in server t such that the matrix G = Gi + G2 + ■ ■ ■ + Gs 
with high probability achieves 

\\A-G\\F<{l + e) min \\A-X\\f 

X:rank(X)<k 

using linear space, polynomial time and with total communication complexity 0{sd'^ /e^). 



^The hidden constants in our Q,, O notation are all independent of n, k, d. 



This result uses a r x n psuedo-random projection matrix P. Alon, Matias and Szegedy [3] use 
P with full independence among rows, but with only 0(l)-way independence within a row to save 
space; they show for one fixed vector x G R", ||-Pa^|P estimates ||x|p to relative error. Here, we will 
use P with 0(d)-way independence within a row and full independence among rows with r € 0{d). 
We will prove that with high probability, simultaneously for all x € R'^, we have ||P^x|p estimates 
\\Ax\\^. 

Finally, we consider the problem of Clustering multi-dimensional data. This problem has re- 
ceived much attention in the traditional computational settings (polynomial-time approximation 
algorithms [231 [T^ |6l [30} [T5t I17| . More to the point here, it is an important problem for many 
modern large data sets and has therefore been considered extensively in the setting of streaming 
algorithms as well more recently in parallel and distributed machine learning settings (see e.g.. 
Chapters 3,4,5 in [9]). 

We now define the clustering problem precisely. Data points are rows of a n x d matrix A. The 
rows of A are partitioned among s processors with the t'th processor having nt rows which form 
an nj X d matrix At. A ^-clustering is defined by an n x d matrix C of centers, where, each row of 
C is one of the k centers and the i'th row of C is the cluster center closest to the i th data point, 
namely, the i th row of A. 

One approach to clustering is to use a stochastic model of data, say a mixture of k Gaussians, 
with the assumption that the means of the Gaussians are well-separated so that the clusters are 
distinct and then devise algorithms to find the clusters. The separation assumed could be roughly 
stated as "the means of any two different Gaussians are at least fifc(l) standard deviations apart". 
For general Gaussians, the variance could be different in different directions; so one takes the 
maximum variance in any direction. 

We do not assume any stochastic model of data. Following |30lll7j . we can define an analog of 
standard deviation: for a clustering C, define o"(C) by 

a{Cf = -ms.^\\{A-C)vf. 
n \\v\\=i 

In words, o"(C)^ is the maximum average squared distance of any data point to its center along a 
direction v, the maximum taken over all directions v. Clearly, <j{C) is just l/y^ times the spectral 
norm of ^ — C. 

Definition 1.1 A clustering C is said to be proper if every pair of centers ofC is at least co{k)a(C) 
apart (where, Co(/c) is a function of k alone). 

Definition 1.2 Any k-clustering C with the property that to every center fi of C, there is a center 
V of C with l/i — z^l < ci(/c)(t(C), is said to be a valid approximation to C. 

The reason for this definition is that if there is a proper clustering C and we find a valid 
approximation C to C, then it is not difficult to show that C and C differ only in a small fraction 
of the data points (with a suitable choice of cq , ci ) . Can such a valid clustering be computed by a 
nimble algorithm? Surprisingly, the answer is yes, in the row partition model. 

Theorem 1.6 Suppose the rows of an nx d matrix A are the data points to be k-clustered and the 
rows of A are partitioned among s servers. Assume that there is a proper k-clustering C. There is 
a nimble algorithm which finds a valid approximation to C . The algorithm takes polynomial time 
and uses 0{d^ + /c^) total communication for s = 0(1) servers. 



2 Estimating frequency moments 

Let fij denote the frequency of the z'th element in the j'th server (i.e., the sum of aU updates to 
fi on the j th server) when there are n distinct elements and s servers. Then the fc'th frequency 
moment of the data is 

k 




A fundamental problem is to estimate estimate frequency moments efficiently. 

2.1 Two servers 

To warm up, we consider the problem of estimating the third moment when data is split between two 
servers. Let Ui, Vi denote the frequencies of the i'th element in the two servers, so that fi = Ui + Vi. 
Then we can use the following algorithm: 

1. The two servers independently compute u^ and vf. 

2. The first server samples an i.i.d. subset S according to the distribution that samples j with 
probability 

,3 



Pj 



u". 



and announces their frequencies in its data subset. The second server computes A 



'^'l2ies'^'^i'^j/Pj ^^'^ announces. 



3. The servers reverse their roles and estimate B = ^^j^gUjv'j/qj where now the sample S is 
drawn by the seccond server, according to q proportional to v^. 



J 



4. The final estimate is A + B + ^^ uf + v. 



Lemma 2.1 Let X be a random variable set to u'^Vj with probability pj = ithj X^^Li ^^ . Then, 
E{X) = Y^ u\vi and Var(X) < i^uf + vf] . 

i=l Vi=l / 

Proof. We bound the variance by the second moment. Let J = {i : Ui,Vi > 0}. 



Var(X) < Y. 



^eJ P^ 



\i=i / \iej 

^ (E-;^j(E-? + 

n 

E-? + 



vf 



< 



m3 1 



^i=l 



D 

From the lemma it follows that we can estimate the sum of the two mixed terms in the expansion 
of {ui + Vi)"^ to within e times the third moment using 0(l/e^) samples. Thus, the third moment 
of element frequencies can be estimated in the cloud with two servers to within 1 + e relative error 
with probability at least 3/4 using 0(logn/e^) communication and 0{n) time. 

Next we estimate the fc'th moment by extending the above algorithm. 

1. The two servers separately compute uf and vf. 

2. The first server samples an i.i.d. subset S according to the distribution that samples j with 
probability 

Pj- 



and announces the frequencies of the sampled elements in its data subset. The second server 
computes 

r=[fe/2] ^ ^ jeS ^^ 

and announces. 
3. The servers reverse their roles and estimate 
'-' 'k\ ^ u'T'v' 



r=\k/2\ ^ ^ jeS 



where now the sample S is drawn by the seccond server, according to q proportional to v^ 
4. The final estimate \s A + B + '}2iu'l + v^ . 



Lemma 2.2 Let r > k/2 and X be a random variable equal to u^jVj ^ /pj with probability pj 
u'i/E'iliUi- Then 



E{X) = 2^ u'^v^-"- and Var(X) < ^^{uf + v, 



Proof. The proof is a direct extension of the case k = 3. Let J be the subset of indices where 
Ui,Vi are both nonzero. 

Var(X) < V^^t^ 



\j=i / VieJ 



2r-k 2k-2r 




We used the fact that for any r > A;/2, 

D 

Theorem 2.3 For two servers and any constant k, the k 'th moment of element frequencies can he 
estimated in the cloud with two servers to within 1 + e relative error with probability at least 3/4 
using O {log n/e^) communication and 0{n) time. 

2.2 The general setting 

Here we consider the setting of s > 2 servers. First we note that there is a lower bound of 
f](s ) from the communication complexity of the multi-party set disjointness problem defined 
as follows: given a universe of n elements and t players receiving a subset, solve the promise 
problem: Distinguish between the subsets being disjoint OR the subsets having exactly one element 
in common to all of them and no other intersection between any of them. 

Theorem 2.4 fj^l f^ The communication complexity of multi-party set disjontness with n elements 
and t players is Q{n/tlogt). 

This theorem readily implies the lower bound of Theorem 11.21 by the reduction of [3j. 
Proof, (of Theorem 11.21 ) The reduction from set disjointness to /c'th moment estimation is 
essentially the same as the one given in [3|. Each of the s servers gets m distinct elements with 
frequency 1. Moreover, either the subsets of elements for each server are completely disjoint or there 
is one element common to all s servers. Thus the frequency moment is either sm or s{m — 1) + s^. 
Estimating the frequency moment to a factor less than {s{m—l) + s )/sm solves the set disjointness 
problem and thus requires Q,{sm/ slogs) = Q{m/ slogs) communication. Setting m = 2s^~^/s 
proves the corollary. D 

We next describe the nimble algorithm that nearly matches this lower bound. 



Recall that the frequency of the i'th element in the j'th server is fij. We expand the A;'th 
moment: 

n / s \ n / r \ ^ 

E E/.- = E E C, Jn/? 

■^ lri,...,rj ^J-J--^*^'' 
We estimate each internal sum separately as follows. Consider a generic term 

n 

E/A^---/- (1) 

1=1 
where ri, . . . , r^ > 1 and YlT=i ^i ~ ^' ^^^ ^ ~ '^^■^^j' fij- 

1. Each server j classifies its frequencies fij into at most logM buckets, with the Tth bucket Bi 
having indices i with fij G [2,2]. 

2. Each server j does the following: 

(a) For each bucket I, server j samples a set S of indices from the bucket according to 
Pi — fij/ YlteB ftj ^^d announces them. 

(b) Every other server announces frequencies of indices in S and have fij < 2K Let / = {i E 
5:/i,<2'for j = 2,3,...,s}. 

(c) Server j computes 

\B I ™ 

3. The estimate for ([1]) is the sum of computations of each server for each bucket. 
Lemma 2.5 Define the random variable X as 

X = '''-' '' x{i G I) with prob. Pi = J'^ for i e Bi. 
Pi- ^teBi Jti 

Then 

m / " \ 

EW = En/S' "^'^ Var(X) < 22M ^ //i . 
iei j=i \i=i J 



Proof. To bound the variance (of a single random sample), we use the second moment. If 2ri > k, 
then the proof is very similar to that of Lemma 12.21 Assume 2ri < k. 



T-rm , 

Var(X) < Y.^-^ 
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ei P^ 



E/AIE"'"'^' 






I JG-B; / i€l 



fk-2ri 
Jil 



< 



yi&Bi I iel ■'il 



\ieBi ^ 

D 
This lemma readily implies Theorem ll.il 

3 Counting homomorphisms 

We consider the problem of counting the number of copies of a fixed graph i7 in a larger graph G, 
when the vertices of G are partitioned arbitrarily among cloud servers. 

The simplest example is counting the number of paths of length 2. This can be written as 

n 



^(^i.^'G) = E(t) 



where di is the degree of vertex i. Expanding, we get 



^ n n 



2 

Both terms are frequency moments where degree di is the "frequency" of the i'th vertex. So we 
can use the frequency moment estimation algorithm of the previous section. (Since these are only 
first and second moments so far, we could also do it in the streaming model, with vertices arriving 
in arbitrary order, using the approach of [3]). 

Next suppose we want to count the number of stars with t leaves, i.e., Ki t- This is 



t(K.,uG) = ±l^';) 



which is a polynomial in the first t frequency moments. 

To count the number of 4-cycles, let us define dij to be the number of common neighbors of 
vertices i and j. Then, 



*(c^4,G)=5:(^^) 



10 



This is again just the first two moments of the "frequencies" dij. We now prove its generahzation 
to any complete bipartite subgraph. 

Proof, (of Theorem 11.31 ) To count the number of complete bipartite subgraphs K^^t, we define the 
joint degree ds of a subset S" of r vertices as the number of neighbors common to all r vertices in 
S. Then 



C^T/ I CI — ^ V / 



scv,\si- 

This needs only the first t frequency moments of these set degrees. Since vertices are partitioned 
in the cloud, we can keep track of these subset vertex degrees and use the frequency moment 
estimation algorithm from the previous section. D 

This can be generalized to counting all bipartite subgraphs that satisfy degree constraints of a 
certain type. 

Theorem 3.1 Let S be an arbitrary subset of [t] . Let % consist of all bipartite graphs H = (U, V, E) 
with r = \U\, t = \V\ s.t. for every u & U , deg{v) £ S. Then the total number of copies of elements 
of H that occur as subgraphs of a graph G, t{'H,G) can be estimated to relative accuracy e by a 
nimble algorithm using 0((2s)*logn) communication and polynomial time and space on s servers. 

4 Low-rank Approximations 

For a matrix A, define fki^) as: 

fk(A) = min \\A -XIIf- 

X:rank(X)<fc 

Recall that the rank k approximation problem is the following: Given a n x d matrix A, and 
e > 0, find a n X d matrix B of rank at most k such that ||A — -BHi? < (1 + e) • fkiA). 

We will give bounds in terms of d,k,n and e. An interesting range is when d,k £ 0(1) and 
n —7- oo; our algorithms' communication is at most polylogarithmic in this range. 

4.1 Row-Partition Model 

This model is appropriate in situations where data points are rows of a n x d matrix A. The set 
of data points is partitioned (arbitrarily) into s servers. Let nj denote the number of data points 
stored in server t with ni + n2 + • • • + n^ = n. Let At, t = 1,2, ... ,s denote the nt x d matrix of 
data points stored in server t. [Note: each row resides wholly in one server.] 

In the Streaming (1-pass) model, the rank-A; approximation problem cannot be solved with 
polylogarithmic space, even when d, k are 0(1) and the matrix is presented in row-order as shown 
by Clarkson and Woodruff [16j . 

Theorem 4.1 (Theorem 4-10 of fW^.) Suppose anxd matrix A is input to a streaming algorithm 
in row order and assume that d G Vi{k). If the algorithm solves the Rank-k Approximation Problem 
with probability of error at most 1/3, then it must use i^{nk/e) bits of space. 
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In contrast, we will show that there are nimble algorithms which can solve the problem in 
polynomial time and polylogarithmic communication when d, /c G 0(1). One caveat: Clarkson and 
Woodruff's lower bounds here as well one presented later require the full rank k approximation 
matrix to be output and this per se requires Q{nd) bits. But in the cloud set-up, we will not 
require any one server to have the full matrix. Nevertheless, the approximation they together 
compute must be valid for the whole matrix. 

We will describe three algorithms. The first algorithm is very simple: 

First Algorithm 

• Server t computes AfAt and communicates it to server 1. 

• Server 1 computes B = J2t=i Aj At. It does the SVD of B to find the its top k singular 
vectors - fi, f2, . . . , t'fc and communicates these to all the servers. 



T 

J- . 



• Server t computes Ct = At ^j=i VivJ 
The proof of correctness is also simple: we have for any vector v: 

\Avf = J2\Atvf = v^Y.^tAtv. 
t=i t 

So the top k singular vectors of A are just the top k singular vectors of B, namely, fi, f2, . . . , I'fc. It 
is well-known that the best rank k approximation to A is A^^^^^ Vivf and so in this case, in fact, 
we can take e = 0. This is summarized by: 

Theorem 4.2 Suppose the rows of the input n x d matrix A are partitioned among s servers with 
an nt X d matrix At in server t. There is a nimble algorithm which on termination leaves an nt x d 
matrix Ct in server t such that (denoting C as the n x d matrix made up from the Ct), C is the 
best rank k approximation to A. Each server uses polynomial time and the total communication is 
0[sd'^) real numbers. 

The Second Algorithm 

• For t = 1,2, . . . ,s, server t (in parallel) does (truncated) SVD of At to find a nt x k matrix 
Pt and a k X d matrix Rt such that PtRt is the best rank k approximation to At- [The rows 
of Rt are the right singular vectors of At-] 

• Server t communicates Rt and Pt^Pt to server 1. 

• Server 1 finds B = "^^^i RJ Pt" PtRt. It does the SVD of B to find the top k singular vectors 
vi,V2, . . . ,Vk oi B. It communicates these vectors to all servers. 

• Server t computes Ct = At J2i=i "^ivf ■ 

Theorem 4.3 The matrix C (comprising of the Ct found by the algorithm) satisfies: 

\\A-C\\F<3fk{A). 
The algorithm uses polynomial time and communicates 0{skd) real numbers. 



12 



Proof. Define an n x d matrix W: 



W 



PiRi 
P2R2 



\ PpRp I 



We have 

s s 

\\A-W\\l = Y,\\At- PtRtWl = Y^ fkiAtf < fk{Af 



(2) 



t=i 



t=i 



the last because if A' is the best rank k approximation to the full matrix A and we partition the rows 
of A' into A[,A'2, ...,A'^as for A, then fk{At) < \\At - A[\\f and Ylt=i ll^t " ^tWr = \\A - A'Wj,. 
Let V be the d x k matrix whose columns are the top k right singular vectors of W. We will 
see later a nimble algorithm to compute V. Assume for now V is known. Let A' be the best rank 
k approximation to A. 



\\A - WVV^Wf <\\A- W\\f + \\W - WVV'^Wf < \\A 
< \\A -W\\f + \\W-A\\f + \\A - ^'11^ < 3 • fk{A), 



W\\f + \\W -A'Wf 



(3) 

minx:rank(X)<fcl|W^- 



where, for the second inequality, we have used the fact that | |TF— M^V^l^^l \f 
X\\f < II^~^'I|f and for the last inequality, we have used ([2]). 

Now to find V nimbly: Let X be a general d x k matrix whose columns form an orthonormal 
set of vectors. [As is well known, the X which maximizes ||M^XX-^||ir is V.] We have: 



iWXX'^Wl = ^WPtRtXX'^Wl = YTvXX^RjP^PtRtXX'^ 



t=i 
Tr 



i=l 



XX'^i^RfP^PtRAxX^ 



u=i 



TvXX'^BXX'^. 



It follows that the algorithm correctly computes V and this finishes the proof of correctness. The 
communication bound is also simple: Only Rt (each kd reals), Pt^Pt (each k'^ reals) and vi,V2, ■ ■ ■ ,Vk 
{kd reals) are communicated (per server). We may assume without loss of generality that k < d, 
since otherwise, we can just keep Ct = At and meet the requirements of the theorem. So k'^ £ 0{kd) 
and the theorem is proved. D 

It is an interesting open question to improve the factor of 3 in the theorem, hopefully to 1 + e. 
Proof, (of Thm. 11.41 ) To prove this theorem, we have to achieve better communication efficiency 
for sparse matrices. In this case, we use a result of Boutsidis, Drineas and Mahoney [10] (improving 
on a line of work |191 |2T| l20l l33l lllj ) algorithm for row/column-row based relative error low-rank 
matrix approximation. They have shown that from At, we can find 0{k) rows of At so that in their 
span, there is an approximation Dt to At so that ||^t — -DtHi? < 2 • fk{At). Now this algorithm will 
be essentially the same as the second, except, now Rt will be the 0{k) rows of At found above. We 
defer the details. Now the communication is 0{sk) actual rows of A and Pt Pt (which is 0{k^) real 
numbers), and achieves a factor 5 approximation. This can be reduced to 3 + e, by using 0{k/e) 
rows of At in place of Rt and getting 

\\At-Dt\\F<{l + \)fk{At). 
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Using the same analysis as in the proof of Theorem 11.41 we get the approximation factor of 1 + 

(e/2) + 1 + {e/2) + 1 = 3 + e. D 

4.2 Arbitrary Partition 

The crucial point of the nimble algorithm for finding a rank k approximation to A with relative 
error e in the arbitrary partition model will be a random "projection matrix" P. P will be an r x n 
matrix, where, r S 0{d) and we will do computations on PA instead of on A. For this to work, we 
will need that for all vectors x G R"^, |Pj4x| w \Ax\. Note that of course, it suffices to prove for all 
unit length vectors y in the d— dimensional space spanned by the columns of A, we have \Py\ ~ 1. 

The entries of P will be random ±1. If they are completely independent, we can have a theorem 
very similar to the Johnson-Lindenstrauss theorem [3l[Il[5], but complete independence has a high 
space requirement for storage and communication. At the other extreme, the paper of Alon et 
al [3] uses (in effect) P with mutually independent rows, but only say 4-way independence inside 
each row to prove for one vector v that \Pv\ ~ \v\. We need here to prove this for exponentially 
(in d alone) many vector lengths (namely all vectors in an e net of the column space of A). For 
this, we will use greater way independence (but not full independence). The proof that the failure 
probability is exponentially (in d) low for each vector is also a more delicate than usual, since as 
we point out, the usual Hoffding inequality fails. So, we choose here to give a from-first-principles 
description and rigorous proofs of the non-standard parts. 

The projection matrix P will have the following properties: 

1. P is r X n, where, r is in Q.{d/e'^) 

2. Each entry of P is ±1. 

3. The entries in each row of P are m— way independent, where, m = Q{dlog{2/e)). I.e., for 
each i, and ji,J2, ■■■ ,jm and ai,a2,...,am G {-1,+!}, 

Prob {Pij^ = ai;PiJ2 = 02; . . . ; Pij^ = am) = 2~". 

4. The r rows of P are mutually independent. I.e., for any r vectors ^1,^2, ■ ■ ■ ,Vr G { — l, +1}"") 

Prob (Pi = vi;P2 = V2;...;Pr = Vr) = Prob(Pi = t;i)Prob(P2 = V2) ■ ■ ■ Prob(P,. = Vr). 

Lemma 4.4 Assume n is a power of 2 and let F he the finite field with n elements, where, each 
element is represented as a log2 n bit string. Let pi,p2, ■ ■ ■ ,Pr be r polynomials, each of degree m—1 
whose coefficients are all picked uniformly and mutually independently at random from F. Define 
P as follows: Pij is the leading bit of the k-bit binary expansion of pi{j) (polynomial pi evaluated 
at j over F .) Then, 

• P has properties (1) through (4) above. 

• The Pi can all be communicated with 0((i^ log nlog(l/e)/e^) bits. 

The proof of the first part is routine and will be given in the full paper. The second part is 
obvious. 

Now, we prove that for every x G R simultaneously, we have with high probability that ||Pj4x|| 
is an estimate of Ax with relative error e. 
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Lemma 4.5 

Prob (Bx G R"' : ||Pvlx|p ^ ((1 - e)rPxf , (1 + e)r\\Axf)] < cie-"'"^. 

Proof. The proof starts with an e net with e~'^'^ elements for the set of unit length vectors y in the 
column space of A. This space is d dimensional and so it is clear that such an e net exists. Now 
for each y in the net, we wish to prove that the probability that ||-Py|P ^ ((1 — e)r , (1 + e)r) is at 
most e~ . We point out that standard Hoffding inequality will not work, since \Piy\ could be (in 
the worst-case) as much as ^/n, since |-Pi| = ^/n and |y| = 1. Inequalities exploiting finite (rather 
than exponential) moments of \Piy\ are necessary. To this end, we note that for any p < m, p even, 
we have 

where, ji,J2, ■ ■ ■ ,jp are any p indices (not necessarily distinct) from {1, 2, . . . , n}. Terms where any 
j G {1,2,..., n} occurs an odd number of times are zero using ?Ti-way independence. Also Pij to 
even power is 1. So it is easy to see that 

UiJ2v,J<?),(ai,rf2,.--,af) 

where, the d 's are even positive integers summing to p. Note that we have 

,d^,d2,...,dj -^ \{di/2),{d2/2),...,{de/2 
So we get 




E((P.-y)^)<]//M^2/| 

Put Xi = {Pi ■ yf - E{{Pi ■ yf) for i = 1, 2, . . . , r. Then EXi = and Var(X,) = 1. By the above, 

E(X,)P<2P(E((/^,.y)2P) + 2P<(cp)P 
for a constant c. Now taking m to be even and applying Theorem 1 from [28], we get: 

r 

E(||Py|2 -r|™) = E((^X,)™) < {crmr'^ 

i=l 

Using Markov's inequality, we theorefore get 

Proh (\ Py^ -r\ > er) < ^"' ^\ ^^ < -^-^ = Hs" < g-^^d 

for our choice of r = Cm/e^ and m = Ccilog(2/e). 

Now since there are only (2/e)'^ elements in the e-net, this holds simultaneously for all y in the 
net. 

Suppose now z is any unit vector (not necessarily in the net). Then, by repeated approximation, 
we can express z = yi + 02^2 + ^32/3 + " ■ ■ ; where, yi,y2, ■ ■ ■ are unit vectors in the e net and 
afc < e*^"^. From this, the Lemma follows in a standard fashion. D 

We now state the complete algorithm. 
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Server 1 generates the coefficients of tlie polynomials pi,p2, ■ ■ ■ ,Pr as described in Lemma 
(I4.4p . It communicates the coefficients of these polynomials to all servers. 



• Server t uses the polynomials to compute the matrix P as in Lemma (|4.4p . It computes PA^ 
and communicates it to server 1. 

• Server 1 computes Ylt=i ^^t = PA. It does SVD on PA to find the top k singular vectors 
vi,V2,---,Vk and communicates them to all servers. 

• Server t computes Cj = At X]i=i '^i'^J ■ 

We are now ready for the main theorem for an arbitrary partition. 
Proof, (of Theorem ll.5l ) Form an orthonormal basis of R"^ using the right singular vectors of PA. 
Let vi,V2, ■ ■ ■ -jVcihe the basis. 

1=1 i=k+l i=k+l 

= {l + effl{PA). 

Also, suppose now ui,U2, . . . ,Ud is an orthonormal basis consisting of the singular vectors of A. 
Then, we have 

k d 

fk{PAf<\\PA-PAY,Uiuf\\j,= Y^ \PAui\^ 

1=1 i=k+l 

d 

i=k+l 

Thus, 

k 

\\A-AYv^vf\\l<{l + e)^fK{Af 

4 = 1 

and the theorem follows. D 

In contrast, in the Streaming model, even with multiple passes, the problem cannot be solved 
with polylog space. 

Theorem 4.6 (Theorem 4-14 of pMj) For any k,l < k < min{n, d} and any e > 0, any multi-pass 
algorithm for the Rank k approximation problem for input presented in arbitrary order which has 
probability of error at most 1/3 must use Q{{n + d)k\og{nd)) bits of space. 

5 Clustering 

In this section, we prove Theorem 11.61 Resource (time and communication) bounds will be given 
in terms of n, d, k, which can take on any values. Similar to low-rank approximation, our main 
interest here is when n >> d, with n, (i — )• oo but k G 0(1). 
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First, a note of caution. For data points generated from a spherical Gaussian of standard 
deviation 1 in each direction, data points are at distance 0{ydi: 1) from the mean. So, using data 
points as cluster centers (as in say the well-known fe-means +- 1- heuristic |6j) will not be a valid 
approximation. Also, a (1 + e) approximation to the optimal A;- means clustering say (or to the 
optimal A:-median clustering) does not guarantee a valid approximation to C. A simple example of 
this is when data points are generated by a mixture of two spherical Gaussians in R with variance 
1 in every direction and their means separated by ri(l). It is easy to see that the clustering C 
dividing the data points according to which Gaussian they were generated from is proper. But the 
mean-squared distance of a data point to a cluster center is d and so the near optimal clustering 
could have an error of end. So this does not rule out the cluster centers found from being Q{\/ed) 
away from the centers in C! In particular, this means that performing a random projection up 
front that preserves all pairwise distances to within relative error (1 + e) does not suffice. 

In contrast to the remark above that doing near-optimal clustering in R'^ does not necessarily 
yield a valid approximation, it is known that a constant factor approximation to the optimal 
A;- means clustering in the projection to the /c-dimensional SVD sub-space does give us a valid 
approximation: 

Lemma 5.1 (Claim 3.3 of ]29^ and Lemma 5.1 of fSUj) Let V be the space spanned by the top k 
(right) singular vectors of A. A constant factor approximation to the optimal k-means clustering 
of the rows of A projected to V is a valid approximation to C . 

The nimble algorithm for clustering the projected points will also crucially use the important 
result of Chen [15] on "core-sets" . Chen gave the first construction of a coreset of size polynomially 
bounded in dimension. 

Theorem 5.2 \1&^ For any set W of n points in 3f? space to be clustered into k clusters, in 
polynomial time, we can find a weighted subset (called a coreset) X of 0{dk'^) points among them 
such that for any set Y of k centers, the cost of clustering W with centers Y is within a constant 
factor of the cost of clustering X with Y as centers (according to the k-means objective). 

Using the above two properties, and the low-rank approximation of the previous section, we 
will prove that clustering can be achieved by a nimble algorithm. 

Proof, (of Theorem 11.61 ) The algorithm will project the points to their /c-dimensional SVD 
subspace V, then cluster the projected points in the SVD subspace. It follows from Lemma 15.11 
that this will give a valid clustering. What remains is to make the two parts — finding V and 
clustering the projected points — nimble. The first is already done by the first algorithm of the last 
section. At the end of that algorithm, each server has the top k right singular vectors of the whole 
matrix A and so can do the projection of its data points to V. But we still cannot communicate 
the n projected points to figure out a near optimal clustering. 

We now apply Chen's result for the points in the SVD subspace V, so the coreset X given 
by his theorem has size 0{k^) only. So one could /c-cluster X instead of the full point set W. 
Further, Chen's algorithm can be made nimble provided each server already has a constant factor 
approximation to the optimal fc-clustering of T^ - namely, provided each server has the same set 
Y' of 0{k) centers so that the cost of clustering W with Y' as centers is within a constant factor 
of the optimal /c- means cost for W. We describe briefly Chen's algorithm and how it can be made 
nimble. The algorithm partitions W into Wi, W2, ■ ■ ■, where, Wj is the set of points in W with the 
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i th point of Y' (call this yi) as its closest point in Y' . Then, we partition Wi further into "rings", 
namely, we let 

Wij = {xeWr.\x- yi\ G {R.2^-\R.2^)}, 

for a suitable R. Then, the algorithm picks uniformly at random a certain number of points in each 
Wij and together these form the coreset. Now to make this nimble, with Y' on hand, each server 
can find its part of Wij and communicate only the cardinality of it. Based on the cardinality, it 
will randomly draw a certain number of points from its part of Wij to be included in the coreset 
and communicate this to all the servers at the end of which all servers have the coreset. 

It now remains only to see that we can nimbly compute a set Y' as above. But this is straight- 
forward — each server just finds a constant factor approximation to the optimal k clustering of its 
own data points and communicates these centers. Y' will simply be the union of all these centers. 
To summarize, here is the algorithm. 

Clustering Algorithm 

• The servers first use the first algorithm of the last section (communicating AfAt to server 
1 which then finds the top k singular vectors of Ylt=i ^T^t which are also the top k sin- 
gular vectors of A and communicates them to all servers) to find the top k singular vectors 
vi,V2, ■ ■ ■ ,Vk oi A. Let V = the span oi vi,V2, ■ ■ ■ ,Vk- 

• Server t projects the rows of At onto V. 

• Server t finds a factor 2 approximation to the optimal fc-means clustering of its projected 
points. 

• Server t broadcasts the k centers found in the last step to all servers. So now all servers know 
the set Y' of sk centers found; let Y' = {oi, 02, . . . , ask}- 

• Use the algorithm of Chen as described above. 

D 



6 Conclusion 

We have presented algorithms and analysis for frequency moments, counting homomorphisms, low- 
rank approximation and clustering, in that order. The model and results raise several interesting 
questions: (1) Privacy. In the distributed setting, it is often the case that a computation must 
be done privately |22|, [2]. Which problems have privacy-preserving nimble algorithms? (2) Graph 
problems. Do basic graph problems such as finding shortest paths or finding large matchings have 
nimble algorithms? (3) What is the class of homomorphisms that can be approximately counted 
with nimble algorithms? (4) Can we prove lower bounds on sampling from natural distributions 
with nimble algorithms? (e.g., sample item according to its /c'th frequency moment). 

In a practical cloud set-up, there might be some control over where each piece of data is 
stored. However, allocating optimally to achieve best "run-time" efficiency is clearly very hard. We 
assumed that the partition of data into processors is adversarial here, but there is another extreme 
possibility: namely, each piece of input data is assigned on arrival to a uniformly randomly picked 
server. This random partition model may allow more problems to be solved by nimble algorithms. 
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The communication complexity of our rank-A; approximation algorithm is O^scP/e'^). David 
Woodruff showed us that using a 2-round algorithm, the first term can be improved to 0{skd/e'^). 
Jelani Nelson observed that we only need 0(logd)-w[se independence (rather than 0{d)-wise 
independence) reducing the number of bits to communicate the projection matrix to 0(l/e^). 

Ackno'wledgements. We are grateful to Dick Karp, David Woodruff and Jelani Nelson for 
helpful discussions. 
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